HOW DOES TEAM RANDOM FOREST SERVICE PREDICT HOME PRICES FOR NASHVILLE?

Hedonic Prediction & Feature Engineering

I. Introduction
Project Purpose

To predict sale prices of properties in the city of Nashville, Tennessee

Who Cares?

Sale price prediction is a critical area of data science. It is important for buyers, industry professionals, and government. Accurate predictive models are beneficial for buyers and sellers. When both parties have access to the same data, then the best deal for both parties can be negotiated fairly. Price forecasting supports municipalities as a way to levy consistent and fair property tax rates.

We are Team Random Forest Service, and for us, this R-Studio exercise started like all the rest, in a matrix of confusion, metaphorically speaking. But we were especially interested in this project for the opportunity to investigate the relationships between the plethora of factors correlated with the cost of real estate in Nashville.

Why is this hard?

This task is challenging for many reasons. Urban places are inherently complex, and algorithm-building, or predictive modelling, is outside the typical (or traditional) practice of social science. This is a practical exercise, not theoretical, and requires a different set of problem-solving techniques and a wide variety of metrics to determine both accuracy and generalizability. Because of the complexity of cities themselves and the wide variety of work that occurs and entities working within cities, data exists at many scales, units, and levels of nuance and completeness. Logistically, project management (time and skill management), building a workflow, and wrangling data are time consuming. Our team encountered difficulty figuring out how to account for and represent local clustering or comprable, near-by prices (‘comps’). For all of these reasons this project posed a challenge and its completion yielded a commensurate sense of satisfaction (we both learned a lot through this process).

What’s our Strategy?

We started with what we know: open R-Studio, load all the appropriate packages, and set the shared Working Directory. Then, talk it out. And the fun began… discussing “what if” scenarios, finding data, coercing data into the correct format, and hacking the model until we found the selection of variables which yielded the most accurate model. Said in another way, our strategy would best be described as iterative trial and error (and trial).

What did we find?

Home prices are spatially correlated, such that similar prices tend to cluster together. Our model predicts values on average approximately 48% off the actual value of the sale. There is a range in the predictive power of our model. While some predictions are accurate, others (in particular those for expensive properties) are less accurate, which indicates a lack of generalizability.

II. DATA

We are practicing feature engineering, in which we take a variable and recode it into a strong predictor of sale price. We attempted to select variables that were appropriately divided to avoid collinearity (assumption: overloading the model with variables might cause muddling of predictions).

Primary Research Question: What impacts Sale Price?

What we thought? -There is a statistically significant, qualitative difference between subsets of variables, such as decades of effective year built. -Generally speaking, we believe that nearer things (spatial relationships) are important. These spatial relationships impact home prices in many different ways. -Urban and local knowledge may have a greater impact than census or interior feature information. Being both homeowners and urban-dwellers, we felt we could speak to generalizable knowledge about urban-living: urban buyers are willing to pay for convenience. To capture local knowledge we sourced information about the perceived neighborhood quality and amenities in the urban neighborhoods understudy.

We reviewed provided data and gathered additional data from Google My Maps, the Nashville City Planning and government websites, and ArcGIS online, to name a few.

Summary statistics

Description of Variables:

Shootings Percent vacant: from census information Percent renter: from census information Percent white: from census information Percent tree cover (tree canopy) Distance to dumping- from 311 calls density Distance to highway exits Land cover type- divided into forest, developed, open, medium-density, low-density, high-density, and pasture Within 500’ of rehabs- density of building permit applications Within 500’ of newhomes- density of building permit applications Within 500’of additions- density of building permit applications Within 500’ of bars Within 500’ of alcohol carryouts Within 1000’ of AirBnB rentals Distance to MTA Buslines (public transit) Distance to Business Improvement Districts (high density) Distance to Parks Distance to “Anti-amenities” (Fried chicken restaurants, adult entertainment, and Walmarts) Distance to Amenities (Markets, grocery stores, and coffeeshops) School district Neighborhood Assessor Acrage Building Type Fixtures Foundation Story Height Exterior wall Building grade Frame Effective year built- divided by (estimated) architectural or development periods Bedrooms within units Gross squarefeet (sf_sketched in the provided data) Physical depreciation Number of units Baths Halfbaths- we experimented with division of this variable

II.b. DATA: Correlation Matrix

II.c. DATA: Dependent Variable (Sale Price)

II.d. DATA: Independent Variables

II.e. DATA: Scatterplots

III. METHODS

Our first model was a “kitchen sink” model that allowed us to observe the statistical significance of each variable. From that baseline understanding of variable significance, we partitioned 25/75% test/training sample.

We subdivided qualitative variables (from provided data on interior characteristics and neighborhood amenities) into chunks and used trial and error to achieve a model with a relatively accurate but also generalizable prediction. We tested the accuracy of the model and determined that it was not performing much better than a coin toss. We then incorporated local spatial autocorrelation, believing that this variable could have the greatest impact on prices. We found 3 and 5 nearest neighbor and used the 5 nearest neighbor variable, representative of a larger neighborhood unit.

Our focus was on a work flow that would allow selection of variables directly into the regression model, testing of selected variables, and adjustment as necessary.

When we first ran the model, we were not able to predict the high value properties.

We experimented with K-means clustering, with which we were able to isolate areas of similar price and exclude the higher cost areas at will. Moderately priced properties appear to get a predicted price bump in areas with high amounts of AirBnB properties.

Results of K-Means Clustering in Feature-Space

Description of Centers:

##   rentals_1000 avg_Price_5
## 1    12.566667   2717309.5
## 2    10.830071    509273.8
## 3    10.675835   1009992.9
## 4     4.553303    198344.5

Size of Centers:

## [1]   30 2401  509 7054
IV. RESULTS
a. In-sample (training set) model results
## 
## Call:
## lm(formula = log(SalePrice) ~ ., data = df1 %>% select(-test, 
##     -longitude, -latitude, -kenID, -LandUseFul, -Exterior_W, 
##     -Acrage_1, -Phys_Depre, -Foundation, -Fixtures, -NbrhdAssr, 
##     -bedroomsun, -BLdgGrade, -sf_Gross, -landcover, -rehabs_500, 
##     -bedroomsun_1, -halfbaths_1, -halfbaths_12, -halfbaths_13, 
##     -halfbaths_14, -baths_1, -Park_name_1, -addition_500, -landcover, 
##     -effyearbui_1, -Pasture, -NumofUnits_1, -CensusBloc, -Neighborho, 
##     -roomsunits, -sf_bsmt, -sf_sketche, -Zone_Asses, -shootings, 
##     -Bldg_Type, -Dist_AntiAmenity, -pct_vacnt, -pct_renter, -Forest, 
##     -Dev_Open, -Dev_Med, -Dev_Low, -Dev_High, -Built_before_1920, 
##     -Built_1921_1950, -Avg_Price_3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6758 -0.2043  0.0682  0.2770  2.8293 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              3.957e+01  4.770e+00   8.295  < 2e-16 ***
## LocationZi              -7.460e-04  1.280e-04  -5.829 5.79e-09 ***
## CouncilDis               4.992e-03  7.608e-04   6.562 5.63e-11 ***
## Acrage                   1.132e-01  2.125e-02   5.324 1.04e-07 ***
## Story_Heig1 STORY       -1.930e+00  4.904e-01  -3.936 8.34e-05 ***
## Story_Heig1.25 STORY    -1.782e+00  4.954e-01  -3.597 0.000324 ***
## Story_Heig1.5 STORY     -1.839e+00  4.910e-01  -3.746 0.000181 ***
## Story_Heig1.75 STORY    -1.839e+00  4.912e-01  -3.745 0.000182 ***
## Story_Heig2 STORY       -1.940e+00  4.906e-01  -3.955 7.70e-05 ***
## Story_Heig2.25 STORY    -2.033e+00  5.008e-01  -4.060 4.96e-05 ***
## Story_Heig2.5 STORY     -2.021e+00  4.988e-01  -4.052 5.12e-05 ***
## Story_Heig2.75 STORY    -2.049e+00  5.057e-01  -4.052 5.13e-05 ***
## Story_Heig3 STORY       -1.804e+00  4.915e-01  -3.671 0.000243 ***
## Story_Heig4 STORY       -2.266e+00  5.664e-01  -4.001 6.36e-05 ***
## Story_HeigBI-LEVEL      -1.878e+00  4.938e-01  -3.802 0.000144 ***
## Story_HeigCOM 3 STY     -1.968e+00  6.929e-01  -2.841 0.004512 ** 
## Story_HeigCOM 4 STY     -1.435e+00  6.934e-01  -2.069 0.038538 *  
## Story_HeigSPLIT LEVEL   -1.865e+00  4.925e-01  -3.787 0.000153 ***
## sf_finishe               2.557e-04  9.481e-06  26.965  < 2e-16 ***
## sf_bsmt_fi              -5.849e-05  2.663e-05  -2.196 0.028090 *  
## pct_tree                -2.780e-04  2.871e-04  -0.968 0.332946    
## dist_dumping             1.084e-04  1.861e-05   5.827 5.84e-09 ***
## dist_hwyext              3.759e-05  5.567e-06   6.751 1.56e-11 ***
## newhomes_500            -2.436e-03  6.023e-03  -0.404 0.685885    
## bars_500                 1.987e-02  7.863e-03   2.527 0.011510 *  
## carryout_500            -4.618e-02  1.962e-02  -2.354 0.018580 *  
## rentals_1000             5.411e-03  6.781e-04   7.979 1.67e-15 ***
## Frame_1RESD FRAME        1.586e-01  2.460e-02   6.448 1.20e-10 ***
## Frame_1TYPICAL           2.020e-01  2.107e-02   9.587  < 2e-16 ***
## Phys_Depre_1Average      1.273e+00  5.007e-01   2.543 0.011012 *  
## Phys_Depre_1Dilapidated  7.661e-01  5.575e-01   1.374 0.169441    
## Phys_Depre_1Excellent    1.606e+00  6.091e-01   2.636 0.008401 ** 
## Phys_Depre_1Fair         1.145e+00  5.029e-01   2.276 0.022870 *  
## Phys_Depre_1Good         1.425e+00  5.058e-01   2.817 0.004852 ** 
## Phys_Depre_1Poor         9.491e-01  5.097e-01   1.862 0.062655 .  
## Phys_Depre_1Very Good    1.578e+00  5.394e-01   2.926 0.003442 ** 
## Phys_Depre_1Very Poor    1.119e+00  5.269e-01   2.124 0.033709 *  
## Dist_BusLn               6.916e+00  1.656e+00   4.175 3.00e-05 ***
## Dist_BID                -2.493e+00  3.612e-01  -6.903 5.46e-12 ***
## Dist_Amenity            -3.422e+00  4.945e-01  -6.919 4.87e-12 ***
## school_dis               2.060e-03  2.772e-03   0.743 0.457507    
## pct_white                3.286e-01  2.548e-02  12.896  < 2e-16 ***
## avg_Price_5              8.679e-07  2.691e-08  32.250  < 2e-16 ***
## Built_1951_1960         -1.613e-02  3.130e-02  -0.516 0.606189    
## Built_1961_1970          7.509e-03  3.010e-02   0.249 0.803039    
## Built_1971_1980          5.789e-03  2.963e-02   0.195 0.845107    
## Built_1981_1990          5.517e-02  2.738e-02   2.015 0.043899 *  
## Built_1991_2000          1.241e-01  2.750e-02   4.512 6.52e-06 ***
## Built_2001_2010          1.853e-01  2.725e-02   6.797 1.14e-11 ***
## Built_2011_2018          3.254e-01  2.781e-02  11.699  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.489 on 8375 degrees of freedom
## Multiple R-squared:  0.5696, Adjusted R-squared:  0.5671 
## F-statistic: 226.2 on 49 and 8375 DF,  p-value: < 2.2e-16
IV.b. RESULTS: R^2, mean absolute error, and MAPE for test set
## 
## Call:
## lm(formula = (SalePrice) ~ ., data = df1 %>% select(-test, -longitude, 
##     -latitude, -kenID, -LandUseFul, -Exterior_W, -Acrage_1, -Phys_Depre, 
##     -Foundation, -Fixtures, -NbrhdAssr, -bedroomsun, -BLdgGrade, 
##     -sf_Gross, -landcover, -rehabs_500, -bedroomsun_1, -halfbaths_1, 
##     -halfbaths_12, -halfbaths_13, -halfbaths_14, -baths_1, -Park_name_1, 
##     -addition_500, -landcover, -effyearbui_1, -Pasture, -NumofUnits_1, 
##     -CensusBloc, -Neighborho, -roomsunits, -sf_bsmt, -sf_sketche, 
##     -Zone_Asses, -shootings, -Bldg_Type, -Dist_AntiAmenity, -pct_vacnt, 
##     -pct_renter, -Forest, -Dev_Open, -Dev_Med, -Dev_Low, -Dev_High, 
##     -Built_before_1920, -Built_1921_1950, -Avg_Price_3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2403266   -60046    -1762    47127  5614851 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.994e+06  2.104e+06   0.948 0.343366    
## LocationZi              -5.059e+01  5.645e+01  -0.896 0.370212    
## CouncilDis               1.356e+03  3.356e+02   4.042 5.35e-05 ***
## Acrage                   2.024e+04  9.374e+03   2.159 0.030879 *  
## Story_Heig1 STORY       -8.465e+05  2.163e+05  -3.913 9.18e-05 ***
## Story_Heig1.25 STORY    -8.247e+05  2.185e+05  -3.774 0.000162 ***
## Story_Heig1.5 STORY     -8.500e+05  2.166e+05  -3.925 8.76e-05 ***
## Story_Heig1.75 STORY    -8.272e+05  2.167e+05  -3.817 0.000136 ***
## Story_Heig2 STORY       -8.643e+05  2.164e+05  -3.994 6.55e-05 ***
## Story_Heig2.25 STORY    -8.854e+05  2.209e+05  -4.008 6.18e-05 ***
## Story_Heig2.5 STORY     -9.228e+05  2.200e+05  -4.194 2.77e-05 ***
## Story_Heig2.75 STORY    -8.058e+05  2.231e+05  -3.613 0.000305 ***
## Story_Heig3 STORY       -7.971e+05  2.168e+05  -3.677 0.000238 ***
## Story_Heig4 STORY       -1.072e+06  2.498e+05  -4.289 1.81e-05 ***
## Story_HeigBI-LEVEL      -8.897e+05  2.178e+05  -4.085 4.45e-05 ***
## Story_HeigCOM 3 STY     -9.844e+05  3.056e+05  -3.221 0.001282 ** 
## Story_HeigCOM 4 STY     -7.099e+05  3.059e+05  -2.321 0.020322 *  
## Story_HeigSPLIT LEVEL   -8.780e+05  2.172e+05  -4.041 5.36e-05 ***
## sf_finishe               1.041e+02  4.182e+00  24.897  < 2e-16 ***
## sf_bsmt_fi              -4.126e+01  1.175e+01  -3.513 0.000446 ***
## pct_tree                 8.502e+01  1.266e+02   0.671 0.502006    
## dist_dumping             8.934e+00  8.208e+00   1.089 0.276378    
## dist_hwyext              1.752e+00  2.456e+00   0.713 0.475617    
## newhomes_500            -8.061e+03  2.657e+03  -3.034 0.002419 ** 
## bars_500                 1.975e+04  3.468e+03   5.694 1.28e-08 ***
## carryout_500             3.824e+03  8.653e+03   0.442 0.658577    
## rentals_1000             6.908e+02  2.991e+02   2.309 0.020957 *  
## Frame_1RESD FRAME        6.223e+04  1.085e+04   5.735 1.01e-08 ***
## Frame_1TYPICAL           6.307e+04  9.294e+03   6.786 1.23e-11 ***
## Phys_Depre_1Average      5.836e+05  2.209e+05   2.642 0.008251 ** 
## Phys_Depre_1Dilapidated  5.542e+05  2.459e+05   2.253 0.024256 *  
## Phys_Depre_1Excellent    7.545e+05  2.687e+05   2.808 0.004995 ** 
## Phys_Depre_1Fair         5.646e+05  2.218e+05   2.545 0.010945 *  
## Phys_Depre_1Good         6.527e+05  2.231e+05   2.925 0.003454 ** 
## Phys_Depre_1Poor         5.313e+05  2.249e+05   2.363 0.018164 *  
## Phys_Depre_1Very Good    7.528e+05  2.379e+05   3.164 0.001562 ** 
## Phys_Depre_1Very Poor    5.389e+05  2.324e+05   2.319 0.020446 *  
## Dist_BusLn              -7.800e+05  7.306e+05  -1.068 0.285709    
## Dist_BID                -2.649e+05  1.593e+05  -1.662 0.096497 .  
## Dist_Amenity            -3.414e+05  2.181e+05  -1.565 0.117641    
## school_dis               1.852e+02  1.223e+03   0.151 0.879592    
## pct_white                3.414e+04  1.124e+04   3.038 0.002392 ** 
## avg_Price_5              5.498e-01  1.187e-02  46.314  < 2e-16 ***
## Built_1951_1960         -1.090e+04  1.381e+04  -0.790 0.429807    
## Built_1961_1970         -3.678e+03  1.328e+04  -0.277 0.781816    
## Built_1971_1980          4.332e+03  1.307e+04   0.331 0.740344    
## Built_1981_1990          7.068e+02  1.208e+04   0.059 0.953325    
## Built_1991_2000          1.007e+04  1.213e+04   0.830 0.406379    
## Built_2001_2010          2.443e+04  1.202e+04   2.032 0.042143 *  
## Built_2011_2018          8.993e+04  1.227e+04   7.330 2.51e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 215700 on 8375 degrees of freedom
## Multiple R-squared:  0.5126, Adjusted R-squared:  0.5097 
## F-statistic: 179.7 on 49 and 8375 DF,  p-value: < 2.2e-16
SUMMARY STATISTICS for the Test Set

MAE: 98,438

MAPE: .4834

IV.c. RESULTS: Cross-validataion tests on the training set (mean & standard deviation R^2)

RMSE
*193,968.1

Rsquared *0.5868058

MAE
*100,110.5

Cross-validation R^2 as a histogram

It is not over fit. Our model has a 48% accuracy rating.

IV.d. RESULTS: Predicted prices as a function of observed

IV.e. RESULTS: Residuals Map for 25% of Test Set & Moran’s test

## 
##  Moran I test under randomisation
## 
## data:  residTest$residuals  
## weights: nb2listw(spatialWeights, style = "W")    
## 
## Moran I statistic standard deviate = 5.7461, p-value = 4.567e-09
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.0901261043     -0.0005885815      0.0002492363
IV.f. RESULTS: Predicted Values of the Entire Dataset

IV.g. RESULTS: Predicted Values for all Sales

IV.h. RESULTS: Mean absolute percentage error (MAPE) by zip code

IV.i. RESULTS: Mean absolute percentage error (MAPE) by zip code as a function of mean price by neighborhood

V. DISCUSSION

The major flaw in our model is that it fails at predicting the most expensive sale prices. The low accuracy of predictions in higher price areas are the result of either not adequately capturing or omitting variables. The most impactful interior characteristic was gross squarefootage, but amenity and neighborhood characteristics generally had a greater impact to the model accuracy than internal characteristics, which supports our hypothesis that urban buyers are motivated by proximity to amenities. Neighborhood price values (or comps, represented by 5 nearest neighbors) had a significantly impacted sale price.
We divided factors to avoid collinearity and did not insert duplicate divisions, ie. 3 and 5 nearest neighbor sale prices (assumption: overloading the model with variables might cause muddling of predictions). However, after discussion with other groups about their results and approach, it appears that nearest 3, 5, and 10 neighbors all could’ve been input to the model to improve accuracy. Generally, our hypotheses were confirmed, however, we learned that “local knowledge” is best represented by spatial neighborhood characteristics and sale comps.

VI. CONCLUSION

We would not recommend this model to Zillow. This was a difficult project but we learned a lot. With more time, we are confident that we could drastically improve our predicting power with the lessons learned in the last two weeks. In the future, we could focus on creating dummy variables that divide variables to capture qualitative differences, for example, the number of stories isn’t as significant as homes that have 3 stories. Chopping up data, like effective year built would improve accuracy of predictions. Another improvement that would directly impact the predictions for higher prices is the spatial autocorrelation variable. Lastly, for this particular project, we assumed that variables were the same globally, throughout the entire city, but in future, using other types of regression (rather than OLS) would likely yield a better model.